Take-home Exercise 1: Demographic structures and distribution of Singapore in 2024

Author

Vanessa Riadi

Published

May 9, 2025

Modified

May 9, 2025

1 Overview

A local online media company that publishes daily content on digital platforms is planning to release an article on demographic structures and distribution of Singapore in 2024.

2 Objective

Assuming the role of the graphical editor of the media company, you are tasked to prepare at most three data visualisations for the article.

3 Analytical Toolkit: RStudio

RStudio and Quarto are used as the primary analytical toolkit for this project. The data is processed using appropriate tidyverse family of packages and the data visualisation prepared using ggplot2 and its extensions.

Before we get started, it is important for us to ensure that the required R packages have been installed.

Install pacman package

If you have yet to install pacman, install itby typing below in the Console:

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("pacman")

We then load the following R packages using the pacman::p_load() function:

- tidyverse, a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs.
- patchwork for combining multiple ggplot2 graphs into one figure.
- plotly, R library for plotting interactive statistical graphs.
- ggrepel: a R package provides geoms for ggplot2 to repel overlapping text labels.
- ggthemes: a R package provides some extra themes, geoms, and scales for ggplot.
- hrbrthemes: a R package provides typography-centric themes and theme components for ggplot2.
- qreport: Provides statistical components, tables, and graphs. - ggiraph: for making β€˜ggplot’ graphics interactive.

pacman::p_load(tidyverse, patchwork,
               plotly, ggrepel,
               ggthemes, hrbrthemes, ggiraph, DT, qreport)

4 Data

Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex, June 2024 dataset shared by Department of Statistics, Singapore(DOS)

4.1 Load the Data

First we load the data.

demographic_data <- read_csv("data/respopagesexfa2024.csv")

4.2 Check the Data

From the first glance, we notice that there are β€˜0’ Pop in the dataset. For this exercise, we will be focusing on the top-level visualization per Planning Areas and granular details like Population per Subzone and Floor Area aren’t necessary. We should exclude those rows with zero population values at the Planning Area level. This will help clean up the data and make the visualizations clearer by removing unnecessary zeros. We will do it in Chapter 4.3

Did you know?

There are more R packages that can help you to view or describe data. E.g. Hmisc, psych, qreport package. I will be using qreport here

If you have yet to install qreport, install it by typing below in the Console:

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("qreport")

Here I am using qreportpackage’s dataOverview that I already pre-load earlier.

dataOverview(demographic_data, plot = c("none"),)
demographic_data has 75696 observations (75696 complete) and 7 variables (7 complete)


|Variable |Type       | Distinct|  Info| Symmetry| NAs|Rarest Value            | Frequency of Rarest Value|Mode        | Frequency of Mode|
|:--------|:----------|--------:|-----:|--------:|---:|:-----------------------|-------------------------:|:-----------|-----------------:|
|PA       |Nonnumeric |       55| 0.999|    0.991|   0|Central Water Catchment |                       228|Bukit Merah |              3876|
|SZ       |Nonnumeric |      332| 1.000|    1.000|   0|Admiralty               |                       228|Admiralty   |               228|
|AG       |Discrete   |       19| 0.997|    1.000|   0|0_to_4                  |                      3984|0_to_4      |              3984|
|Sex      |Discrete   |        2| 0.750|    1.000|   0|Females                 |                     37848|Females     |             37848|
|FA       |Discrete   |        6| 0.972|    1.000|   0|<= 60                   |                     12616|<= 60       |             12616|
|Pop      |Continuous |      183| 0.831|    6.953|   0|1260                    |                         1|0           |             41742|
|Time     |Discrete   |        1| 0.000|    1.000|   0|2024                    |                     75696|2024        |             75696|

Let’s also count what’s the total Pop

cntpop <- demographic_data %>%
  summarise(Pop = sum(Pop, na.rm = TRUE)) 
cat(cntpop$Pop)
4187720

Observation

  • The data shows Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex as of June 2024 with total population of 4,187,720.

  • We observe that there are 75,696 rows and 7 columns. No missing values are observed. Refer to the column legend in Appendix A

  • There are a total of seven attributes. 5 of them are categorical data type and the other three are in numerical data type.

    • The categorical attributes are: PA, SZ, AG, Sex, FA.
    • The numerical attributes are: Pop, Time.
  • We can also observe how many distinct values for each Variable. This will help us think what to use for our visualization.

4.3 Data Preparation

4.3.1 Cleaning Data

As mentioned earlier, we will be focusing on the top-level visualization per Planning Areas and granular details like Population per Subzone and Floor Area aren’t necessary. Here we will exclude those rows with zero population values at the Planning Area level by using filter.

Clean using filter and display as data table

demographic_data_clean <- demographic_data %>%
  filter(Pop > 0)

DT::datatable(demographic_data_clean , options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))

4.3.2 Age Group recoding

We will be recoding the Age Group to three groups for clearer visualization: Below 20 years, 20-64 Years, 65 years & Over following National Statistical Standards Recommendations on Definition and Classification of Age (More details in Appendix B ).

demographic_data_recode <- demographic_data_clean %>%
  mutate(
    AG_recode = recode(
      AG,
      "0_to_4" = "Below 20 Years",
      "5_to_9" = "Below 20 Years",
      "10_to_14" = "Below 20 Years",
      "15_to_19" = "Below 20 Years",
      "20_to_24" = "20-64 Years",
      "25_to_29" = "20-64 Years",
      "30_to_34" = "20-64 Years",
      "35_to_39" = "20-64 Years",
      "40_to_44" = "20-64 Years",
      "45_to_49" = "20-64 Years",
      "50_to_54" = "20-64 Years",
      "55_to_59" = "20-64 Years",
      "60_to_64" = "20-64 Years",
      "65_to_69" = "65 Years and Above",
      "70_to_74" = "65 Years and Above",
      "75_to_79" = "65 Years and Above",
      "80_to_84" = "65 Years and Above",
      "85_to_89" = "65 Years and Above",
      "90_and_over" = "65 Years and Above"
    )
  )

DT::datatable(demographic_data_recode , options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))

5 Data Visualisation, Observation, and Insights

5.1 Top 10 Planning Areas (PA) Ranked by Size of Resident Population (Pop)

top10PA <- demographic_data_clean %>%
  group_by(PA) %>%
  summarise(Pop = sum(Pop, na.rm = TRUE)) %>%
  slice_max(order_by = Pop, n = 10)

top10plot <- ggplot(data = top10PA, 
       aes(y = reorder(PA, Pop/1000), x = Pop/1000)) +  # reorder PA by Pop
  geom_col(show.legend = FALSE, fill = "pink4") +
  geom_text(aes(label = (Pop/1000)),
            hjust = -0.2, color = "black", size = 3) +
  ggtitle("Top 10 Planning Areas in 2024\nRanked by Size of Singapore Resident Population",
  subtitle = paste("Total resident population:", 
                 format(round(cntpop$Pop / 1000, 2), big.mark = ","), 
                 "thousand")) +
  labs(
    y = NULL,
    x = "Resident Population\nin thousands (β€˜000)",
    caption = "Source: singstat.gov.sg"
  ) +
  theme_ipsum(base_family = "Arial",
  plot_title_size = 14,
  subtitle_size = 10,
  caption_size = 8,
  plot_title_face = "bold",
  caption_face = "italic",
  grid = "",
  axis_title_face = "bold",
  axis_title_size = 11) +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_text(size=11, face="bold"),
        axis.title.x = element_text(hjust = 0.5)
        )+
  scale_x_continuous(expand = expansion(mult = c(0, 0.1)))

top10plot
# Total population of the top 10 Planning Area
cntpoptop10 <- top10PA %>%
  summarise(Pop = sum(Pop, na.rm = TRUE)) 
cat(cntpoptop10$Pop)
2358550
# The percentage of the population of the top 10 most populous Planning Area relative to the total population.
perc_pop_top10 <- (cntpoptop10 / cntpop) * 100
cat(perc_pop_top10$Pop)
56.32062

πŸŽ€ Insights Plot 1

  • Slightly over half (56.3%) of the 4,187.72 thousand (~4.19 million) residents in Singapore stayed in the top 10 planning areas of residence.

  • There were five planning areas with more than 250,000 residents each, namely Tampines, Bedok, Sengkang, Jurong West, and Woodlands.

  • Tampines was the most populated with 284,720 residents.

5.3 Age Group and Sex distribution

# Define age group levels and labels
AG_levels <- c(
  "0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24",
  "25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49",
  "50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74",
  "75_to_79", "80_to_84", "85_to_89", "90_and_over"
)

AG_labels <- c(
  "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39",
  "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79",
  "80-84", "85-89", ">90"
)

AG_lookup <- setNames(AG_labels, AG_levels)

# Compute population shares by age group
AGsum2 <- demographic_data_clean %>%
  group_by(AG) %>%
  summarise(Pop = sum(Pop, na.rm = TRUE), .groups = "drop") %>%
  mutate(
    AG = factor(AG, levels = AG_levels),
    Pop_share = Pop / sum(Pop),
    cum_share = cumsum(Pop_share),
    percentile = round(cum_share * 100, 1)
  ) %>%
  arrange(AG)

# Median and Q3 age groups
median_AG <- AGsum2 %>% filter(cumsum(Pop) >= sum(Pop) / 2) %>% slice(1) %>% pull(AG)
q3_AG     <- AGsum2 %>% filter(cumsum(Pop) >= sum(Pop) * 0.75) %>% slice(1) %>% pull(AG)

# Convert median and Q3 to labels
median_label <- AG_lookup[as.character(median_AG)]
q3_label     <- AG_lookup[as.character(q3_AG)]

# Prepare population pyramid data
pyramid_data <- demographic_data_clean %>%
  group_by(AG, Sex) %>%
  summarise(Population = sum(Pop, na.rm = TRUE) / 1000, .groups = "drop") %>%
  mutate(
    AG = factor(AG, levels = AG_levels, labels = AG_labels, ordered = TRUE),
    PopPercentage = ifelse(Sex == "Females", -Population, Population) / sum(Population) * 100,
    PopPercentage = round(PopPercentage, 2),
    Signal = ifelse(Sex == "Females", -1, 1)
  )


# Create the population pyramid plot
pyramid_plot <- ggplot(pyramid_data, aes(x = AG, y = PopPercentage, fill = Sex)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = PopPercentage + Signal * 0.5, label = abs(PopPercentage)),
            size = 3, color = "black") +
  coord_flip() +
  scale_fill_manual(values = c("Females" = "pink2", "Males" = "steelblue"),
                    guide = guide_legend(override.aes = list(fill = NA))) +
  scale_y_continuous(labels = abs) +
  annotate("segment", x = median_label, xend = median_label, y = -4, yend = 3.6,
           color = "red4", linewidth = 0.7, linetype = "dotted") +
  annotate("text", x = median_label, y = 5, label = "Median",
           color = "red4", size = 2.8, fontface = "bold") +
  annotate("segment", x = q3_label, xend = q3_label, y = -3.65, yend = 3.65,
           color = "red4", linewidth = 0.7, linetype = "dotted") +
  annotate("text", x = q3_label, y = 5, label = "Q3",
           color = "red4", size = 2.8, fontface = "bold") +
  ggtitle("Population Pyramid of Singapore Residents\nby Age and Sex 2024",
          subtitle = "in percentage (%)") +
  labs(
    y = "Population (%)",
    x = "Age Group",
    fill = "Sex",
    caption = "Source: singstat.gov.sg"
  ) +
  theme_ipsum(base_family = "Arial",
              plot_title_size = 14,
              subtitle_size = 10,
              caption_size = 8,
              plot_title_face = "bold",
              caption_face = "italic",
              grid = "Y",
              axis_title_face = "bold",
              axis_title_size = 11,
              axis_text_size = 8) +
  theme(
    strip.text = element_text(face = "bold"),
    axis.title.x = element_text(hjust = 0.5),
    axis.title.y = element_text(hjust = 0.5),
    legend.position = "top",
    legend.title = element_blank(),
    legend.justification = c(0.45, 0),
    legend.margin = margin(t = -20, r = 0, b = -10, l = 0, unit = "pt")
  )

pyramid_plot

Alternatively for more simpler visualisation we can also use this plot, though we can only infer Age Distribution Insights from this plot.

# Details of the population pyramid distribution
print(pyramid_data, n = 38)
# A tibble: 38 Γ— 5
   AG    Sex     Population PopPercentage Signal
   <ord> <chr>        <dbl>         <dbl>  <dbl>
 1 0-4   Females      83.4          -1.99     -1
 2 0-4   Males        87.5           2.09      1
 3 10-14 Females     100.           -2.4      -1
 4 10-14 Males       104.            2.49      1
 5 15-19 Females     104.           -2.49     -1
 6 15-19 Males       107.            2.56      1
 7 20-24 Females     110.           -2.64     -1
 8 20-24 Males       115.            2.74      1
 9 25-29 Females     135.           -3.22     -1
10 25-29 Males       135.            3.23      1
11 30-34 Females     166.           -3.95     -1
12 30-34 Males       156.            3.71      1
13 35-39 Females     166            -3.96     -1
14 35-39 Males       149.            3.56      1
15 40-44 Females     164.           -3.92     -1
16 40-44 Males       147.            3.5       1
17 45-49 Females     158.           -3.77     -1
18 45-49 Males       144.            3.44      1
19 50-54 Females     158.           -3.78     -1
20 50-54 Males       149.            3.56      1
21 55-59 Females     150.           -3.58     -1
22 55-59 Males       145.            3.46      1
23 5-9   Females      98.9          -2.36     -1
24 5-9   Males       104.            2.47      1
25 60-64 Females     150.           -3.57     -1
26 60-64 Males       147.            3.52      1
27 65-69 Females     136.           -3.25     -1
28 65-69 Males       130.            3.12      1
29 70-74 Females     107.           -2.57     -1
30 70-74 Males        99.3           2.37      1
31 75-79 Females      72.9          -1.74     -1
32 75-79 Males        61.9           1.48      1
33 80-84 Females      45.1          -1.08     -1
34 80-84 Males        32.7           0.78      1
35 85-89 Females      27.2          -0.65     -1
36 85-89 Males        16.9           0.4       1
37 >90   Females      17.4          -0.42     -1
38 >90   Males         7.73          0.18      1
# Details of the AG distribution
AGsum2
# A tibble: 19 Γ— 5
   AG             Pop Pop_share cum_share percentile
   <fct>        <dbl>     <dbl>     <dbl>      <dbl>
 1 0_to_4      170930   0.0408     0.0408        4.1
 2 5_to_9      202420   0.0483     0.0892        8.9
 3 10_to_14    204610   0.0489     0.138        13.8
 4 15_to_19    211560   0.0505     0.189        18.9
 5 20_to_24    225020   0.0537     0.242        24.2
 6 25_to_29    270090   0.0645     0.307        30.7
 7 30_to_34    321010   0.0767     0.383        38.3
 8 35_to_39    315180   0.0753     0.459        45.9
 9 40_to_44    310700   0.0742     0.533        53.3
10 45_to_49    301820   0.0721     0.605        60.5
11 50_to_54    307760   0.0735     0.678        67.8
12 55_to_59    294500   0.0703     0.749        74.9
13 60_to_64    297020   0.0709     0.820        82  
14 65_to_69    266580   0.0637     0.883        88.3
15 70_to_74    206760   0.0494     0.933        93.3
16 75_to_79    134810   0.0322     0.965        96.5
17 80_to_84     77750   0.0186     0.983        98.3
18 85_to_89     44050   0.0105     0.994        99.4
19 90_and_over  25150   0.00601    1           100  

πŸŽ€ Insights Plot 2

  • The median age falls within 40 to 44 age group. About 25% are aged 60 and above, with 18% aged 65 and older, highlighting aging population trend.
  • Children (0–14) account for only about 13.8% of the population, indicating low birth rates. Slightly more males than females are observed in this group.
  • The working-age population (15–64 years), as defined by the Ministry of Manpower, comprises around 68.2% of the population. This reflects a strong labour force, though future demographic challenges may arise as this group continues to age.
  • The 25-64 age group shows a near-equal gender balance.
  • The gender gap widens in the older age cohorts, with majority female in the 80+ age groups. This show that females live longer than males on average, consistent with the life expectancy at birth between the different gendersfrom 2023 report by the Singapore Department of Statistics.

5.3 Association Between Age Group Distribution and Planning Area in Singapore

table_data <- table(demographic_data_recode$AG_recode, demographic_data_recode$PA)

chi_test <- chisq.test(table_data)

residuals_df <- as.data.frame(as.table(chi_test$residuals))

AG_levels <- c("Below 20 Years", "20-64 Years", "65 Years and Above")

AG_labels <- c("Below 20 Years", "20-64 Years", "65 Years and Above")

residuals_df <- residuals_df %>%
  mutate(Var1 = factor(Var1, levels = AG_levels, labels = AG_labels, ordered = TRUE))

ggplot(residuals_df, aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = round(Freq, 2)), size = 4) +
  scale_fill_gradient2(low = "steelblue", high = "red3", mid = "white", midpoint = 0) +
  labs(x = "AG", y = "PA", fill = "Residuals") +
  ggtitle("Association Between Age Group\nand Planning Area in Singapore 2024",
          subtitle = "Meassured using Pearson's Chi-squared test") +
  labs(caption = "Source: singstat.gov.sg")+
  theme_ipsum(base_family = "Arial",
              plot_title_size = 14,
              subtitle_size = 10,
              caption_size = 8,
              plot_title_face = "bold",
              caption_face = "italic",
              grid = "Y",
              axis_title_face = "bold",
              axis_title_size = 11,
              axis_text_size = 8) +
  theme(
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(size=10, face="bold", angle = -45, hjust = 0),
    axis.text.y = element_text(size=10),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),
    legend.justification = c(0.45, 0),
    legend.margin = margin(t = -20, r = 0, b = -10, l = 0, unit = "pt"))+
  scale_x_discrete(expand = expansion(add = 0.5)) +
  scale_y_discrete(expand = expansion(add = 0.5)) 

πŸŽ€ Insights Plot 3

  • Central Areas like Downtown Core are skewed toward working-age group with strong over-representation of the 20–64 group (+7.46) and lesser children (–4.03) and seniors (–6.69), reflecting its role as a Central Business District. Similar patterns appear in Newton, Museum (Bras Basah, Dhoby Ghaut, Fort Canning), and Singapore River (Boat Quay, Clarke Quay, Robertson Quay).

  • New BTO areas like Tengah and Punggol show more children and lesser seniors, indicating young families moving into new Towns / Planning Areas.

  • Mature Towns such as Bedok, Ang Mo Kio, and Hougang exhibit neutral residuals, suggesting stable populations with a mix of age groups and long-term residents.

  • Affluent Districts such as Bukit Timah, we observe more children (+1.98) and 20-65 group (+2.99) with lesser seniors (-5.75). Proximity to β€œgood schools” for kiasu parents likely attracts young wealthy families with its abundance of private housing. Similar trends observed in Tanglin.

6 Reference

- ggplot for categorical-data
- Describe function
- gt package
- theme for ggplot2
- Recode Values with dplyr
- Customize tick marks and labels
- National Statistical Standards Recommendations on Definition and Classification of Age
- Cencus of Population 2020
- Population Pyramid Plot
- Ageing Population
- Heatmap ggplot2
- Chi Square in r

7 Appendix

7.1 Appendix A

Column Name Description
PA Planning Area
SZ Subzone
AG Age Group
Sex Sex
FA Floor Area of Residence
Pop Resident Count (Population)
Time Time / Period

7.2 Appendix B

National Statistical Standards Recommendations on Definition and Classification of Age

AG AG_recode
0_to_4 Below 20 years
5_to_9 Below 20 years
10_to_14 Below 20 years
15_to_19 Below 20 years
20_to_24 20-64 Years
25_to_29 20-64 Years
30_to_34 20-64 Years
35_to_39 20-64 Years
40_to_44 20-64 Years
45_to_49 20-64 Years
50_to_54 20-64 Years
55_to_59 20-64 Years
60_to_64 20-64 Years
65_to_69 65 years & Over
70_to_74 65 years & Over
75_to_79 65 years & Over
80_to_84 65 years & Over
85_to_89 65 years & Over
90_and_over 65 years & Over